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Abstract. 

In recent work we presented a new approach to the analysis of weighted networks, 
by providing a straightforward generalization of any network measure defined on 
unweighted networks. This approach is based on the translation of a weighted 
network into an ensemble of edges, and is particularly suited to the analysis of 
fully connected weighted networks. Here we apply our method to several such 
networks including distance matrices, and show that the clustering coefficient, 
constructed by using the ensemble approach, provides meaningful insights into 
the systems studied. In the particular case of two data sets from microarray 
experiments the clustering coefficient identifies a number of biologically significant 
genes, outperforming existing identification approaches. 
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The rise of information technology and the internet, as well as the more recent 
advent of high-throughput technologies in biology make it easier to obtain large 
amounts of data on complex networks. Increasingly this also includes data on weighted 
complex networks, which now appear in many different guises: Transport and traffic 
[JJ [2], trade or communication networks, financial networks [3], and collaboration 
networks [4], to name a few. In biology, genetic regulation and transcription [5] and 
protein interaction [6] have been studied in this context. However, the extraction of 
meaningful physical or biological information from these networks is a difficult task. 
For unweighted complex networks, with binary adjacency matrices, a set of local and 
global measures on the network has been defined [7], including the degree of a node, 
its average nearest-neighbour degree [8] and its clustering coefficient [9] . Defining these 
measures for weighted networks is more difficult and has been the subject of recent 
research j2j 02 (TO] [IT] . A review of definitions of weighted clustering coefficients can 
be found in [T2]. 

In a recent paper |13j we introduced a new approach to this problem which allows 
for a straightforward generalization of any measure defined on an unweighted network 
to weighted networks. Here we apply the clustering coefficient defined in this way to 
distance matrices, which are fully connected weighted networks. The distance matrices 
are generated from microarray expression series, so that closely related series (by some 
chosen similarity measure) will be separated by a short distance, which in the network 
picture translates into an edge with a large weight. 

The basis of our approach is to find a continuous bijective map M : K — > [0, 1] 
from the real numbers to the interval between and 1, which maps the weights Wij 6 K 
to a quantity pij G [0, 1]. A simple example of such a map is a linear normalization of 
the weights: 

-mm(w tJ ) 
max(Wij) — mm(Wij) 

This simple normalization maps mia(wij) to zero. While this is often acceptable in 
the case of a distance matrix, one should make a more sophisticated choice of map 
if there are many edges with weight min(wij). Similarly, if the network has negative 
weights as well as positive ones, the normalized modulus of the original weights might 
be a more appropriate choice. A more detailed discussion on the topic of map choice 
can be found in [13] . 

The ideas we introduce in [13] are based on an interpretation of the matrix P 
with entries {p^} as a matrix of probabilities. These probabilities can be interpreted 
as an ensemble of edges, or more concisely, an ensemble network. Thus, just as any 
binary square matrix can be understood as an unweighted network and any real 
square matrix corresponds to a weighted network, any square matrix with entries 
between and 1 corresponds to an ensemble network. If we sample each edge of the 
ensemble network exactly once, we obtain an unweighted network which we term a 
realization of the ensemble network. In particular, p^ is the probability that the edge 
between nodes i and j exists. These concepts are valid both for directed networks, 
with any p^ S [0, 1], and undirected networks, for which pij = pji, so that the matrix 
is symmetric. In a real-world weighted network, the original weights can represent 
almost any physical quantity, such as the strength of a collaboration between two 
scientists, or the number of passengers traveling between two countries. By mapping 
these weights to probabilities we rid ourselves of the interpretational burden of these 
weights, whilst retaining all the topological information they contain. It should be 
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noted that in many cases the interpretation of weights as probabilities also makes 
intuitive physical sense. Whenever the weights in a network represent a magnitude of 
flow, this can be interpreted directly in terms of the probability that a transfer occurs 
during a given unit of time. Examples include traffic and transport networks as well 
as communication networks, where we have units (passengers, money, signals) which 
form an edge, through their transfer, with a probability proportional to the flow rate. 

All measures on unweighted networks can be written as functions of the entries 
ay of an adjacency matrix A. In fact, generally they can be written as a polynomial 
of these entries, or a simple ratio of such polynomials. Note that, for an unweighted 
network, ay = a™ for all positive integers m > 0, so that these polynomials are of 
first order only. Consider a general first-order polynomial, which can be written fully 
expanded as: 

2 n2 N 

/(a) =£ a, II 

g=0 j,k=0 

where N is the number of nodes, the C q are real coefficients and the b(q)jk are a set 
of boolean matrices specifying which adjacency matrix entries appear in each term of 
the polynomial. The probability P q that n^fc=o ^k >3k = 1 m a gi yen realization A is 
simply P q — Ylfk=aPjk ■ Thus, due to the linearity of the polynomial, the average 
/(P) of / over the ensemble network realizations is: 

2" 2 N 

f(P)=J2C q l[p% qhk =f(P) (2) 

q=0 j,k=0 

This means that the value of a polynomial function / of the entries of an unweighted 
network A, averaged over the realizations of a given ensemble network P is equal to 
the value of the polynomial of the ensemble network adjacency matrix itself. 

The degree ki of a given node i in an unweighted network with adjacency matrix 
elements ay is the number of its neighbours, and is written as fcj = J2j a ij- I n a 
weighted network with elements iwy the corresponding quantity has been termed 
the strength of the node i, denoted as Sj, which consists of the sum of the weights: 
Si — J2j w ij ■ I n an ensemble network, the corresponding sum over the edges attached 
to a particular node gives the average degree of node i across realizations, denoted as 
fej and given by fcj = £\ py . 

It is important to note that while the strength of a node in a weighted network 
may have meaning in the context of the network, fcj has a universal meaning, regardless 
of the original meaning of the weights. 

As a more complex example, consider the clustering coefficient of a node i, which 
has been defined [9] as: 

^ _ ^2j^k a ij a jk a ik _ 2j',fc a ij a jk a ik ^ 
fc(fc-l)/2 I2j,k a ij a ik 

where k ^ j : =/= i =/= k in the sums. This corresponds to the number of triangles in the 
network which include node i, divided by the number of pairs of bonds including i, 
which represent potential triangles. Using the ensemble approach with its normalized 
weights this generalizes straightforwardly to: 

c-i = (4) 

L^j,kP' l iP lk 
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Figure 1. Example of the advantages of the ensemble clustering coefficient, as 
shown in our earlier work If 31 : The network of air travel passengers within the 
25 member states of the EU[f 5| is almost fully connected. LEFT: Unweighted 
clustering coefficient versus degree. All 25 data points are projected onto 7 
locations, as a result of the information loss due to discarding the weights, and 
because the network is almost fully connected. CENTER: Clustering coefficient as 
proposed in the literature [2] versus strength. This "mixed" clustering coefficient is 
a function of unweighted and weighted quantities. No clear relationship is evident, 
again because the network is almost fully connected. RIGHT: Ensemble clustering 
coefficient versus ensemble degree. Unlike the other two approaches, those derived 
using the ensemble quantities exhibit a clear negative linear relationship. The 
lines are lines of best fit. Note that the absolute scale of the ensemble clustering 
coefficient depends on the choice of the map M from weights to probabilities, 
which makes the relative values of c| more important than the absolute ones. 



which can be read as the average number of triangles divided by the average number 
of bond pairs. In modified form, this clustering coefficient has appeared in the very 
recent literature [5] but without connection to a general approach to the construction of 
weighted network measures based on a general mapping from weights to probabilities. 
Note that c\ is not the average of Cj over the ensemble. For a detailed discussion of 
this subtlety, see [T5] , 

All measures constructed with the ensemble approach are only functions of the 
normalized weights pij, not of the elements of an unweighted adjacency matrix a.y or 
of the degree k. This distinguishes the ensemble measures from measures proposed for 
weighted networks in the literature, such as the weighted clustering coefficient cf: 

1 yr^(wij+Wik) 

/ CLijOikajk (a) 



Si(ki - 1) . , 
2,* 



and the weighted average nearest-neighbour degree k, 

N 

I 

k. 



Si . 
2=1 



-^TdijWijkj (6) 



Both are defined in [2], and eq. (J5j) is the most frequently cited definition of 
a weighted clustering coefficient in the literature. Due to their construction, these 
measures cannot be used for the analysis of fully connected weighted networks, as 
k™ n i = 1 and cf = 1 for all nodes i in such networks. Fully connected weighted 
networks form an important class of complex networks, for example in the form of the 
(virtually fully-connected) EU air travel network which we analyze in [13] (see Fig. 
[TJ. Furthermore, any matrix of similarities or distances between a number of objects - 
such as for instance microarray data series in biological experiments - can be treated 
as a fully connected weighted network, and thus can be analyzed using the ensemble 
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Figure 2. Receiver-operating characteristic (ROC) diagrams for the yeast cell 
cycle (LEFT) and somitogenesis (RIGHT) datasets, showing the positions of 
known biologically significant genes in a ranking of 200 genes in the rankings 
generated (a) using the ensemble clustering coefficient (solid) and (b) using 
the original pattern-finding approach (dotted) which was used to select the 200 
genes in the first place. In both cases the ensemble clustering coefficient moves 
biologically significant genes to the top of the ranking. 



approach, but not with approaches such as eq. ([5]) and ©, which are "mixed" in 
the sense that they make use of both the unweighted and weighted adjacency matrix 
entries. 

Note that the absolute values of the ensemble clustering coefficient have limited 
meaning, as they are dependent on the map M. It is their relative values which carry 
the information, and these are largely independent of the choice of map M, as long as 
it is bijective. 

Microarrays are one of the most successful high-throughput technologies in 
biology, providing a snapshot of gene expression levels for all of the thousands of 
genes in the genome of a given organism simultaneously. A microarray consists of a 
large number of microscopic spots on a slide (typically made of glass or silicon), which 
each contain copies of a different short DNA sequence (or oligonucleotide) unique to 
a particular gene. Furthermore, the sequence copies in each spot are attached to a 
flourescent marker. If a given gene is expressed in the tissue sample to be examined, 
many copies of this gene will be present in the form of messenger RNA (mRNA), 
which in turn will bind to the sequences on the microarray, causing flourescence of the 
spot. The flourescence of the array of spots is captured by a camera and then read 
out using a computer. 

A series of microarray measurements gives an expression profile for each gene over 
space or time, telling us where and when a given gene is 'switched on'. These sets of 
data series are subjected to detailed analysis, and distance matrices between these 
series, (often calculated using Pearson correlation) typically form an integral part of 
such an analysis. 

Here we calculate the ensemble clustering coefficient for distance matrices 
derived from two entirely different microarray data sets. The first data set 
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consists of microarray data from an experiment studying the formation of vertebra 
(somitogenesis) in mice [TB], from which a list of 200 genes was compiled using 
an existing pattern detection approach [17j . This approach is designed to detect 
biologically significant genes by finding expression profiles which deviate from 
randomness. The second data set is the well-known dataset of yeast cell cycle 
microarray experiments in yeast jH] . Here too the 200 strongest patterns were selected 
using the same approach. 

It should be noted that microarray datasets are notoriously noisy and pre-filtering 
of data based on purely mathematical measures is essential and in fact present in 
almost any microarray study. Our selection method based on pattern detection is 
mathematically rigorous and makes no prior assumptions about the nature of the 
pattern. 

In each of the two datasets the 200 genes are ranked by the amount of pattern they 
contain (and thus by their supposed biological significance). Yet the fully connected 
weighted network which corresponds to a distance matrix between these 200 genes 
contains none of this information. Therefore, when we calculate the ensemble clustering 
coefficient for a distance matrix of 200 genes, we can use the pattern-detection 
approach as a benchmark comparison for the performance of the clustering coefficient 
in finding biologically significant genes. 

For both the mouse somitogenesis and yeast cell cycle datasets we compare 
our predictions to lists of known biologically significant genes. In the case of mouse 
somitogenesis these are 17 genes associated with the Wnt and Notch pathways, listed 
in [16] , and in the case of yeast cell cycle there are 65 genes which can be found in 
two lists of experimentally verified yeast cell cycle genes [19j [20] . 

The distance measure chosen to generate the distance matrix is the algorithmic 
compression of one expression series due to another [17] ■ As can be seen in Fig. [2] the 
ranking generated by using the clustering coefficient clearly outperforms the pattern- 
ranking for both datasets. In the case of the mouse somitogenesis dataset, 11 (64%) of 
the 17 genes known to play a role in somitogenesis are located in the top 13 places (top 
6%) of the ranking. Similarly, in the yeast cell cycle dataset, 31 (48%) of 65 known 
genes occupy places in the top 43 (top 21%). Compared to this, the conventional 
pattern-finding approach fares less well, with 6 (35%) in the top 13 (somitogenesis) 
and 23 (35%) in the top 43 (yeast). The conclusion is that in both datasets the 
ensemble clustering coefficient appears to move biologically significant genes to the 
top of the ranking. 

By transforming a weighted network into an ensemble network, any of the 
numerous measures which have been defined for unweighted networks can be 
straightforwardly generalized to weighted networks. As we have shown in this 
paper, our approach is particularly suited for the analysis of distance matrices. We 
demonstrate this by calculating the ensemble clustering coefficient for the distance 
matrices between microarray data series which successfully identifies many known 
biologically significant genes. These results are an indication that the application of 
complex networks methods to the rather separate field of distance matrix analysis is 
likely to yield valuable insights. 
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